Due to the number of plots, this page may take a minute to load.
This project is to analyze the online activities of Microsoft Developer Network (MSDN) Forums, as shown below, relevant to the numbers of votes, views and replies. MSDN hosts the online communities for developers to post questions, find answers and exchange ideas relevant to Microsoft and open source products and technologies.
The goal is to dvelop a model for predicting votes of a discussion thread.
MSDN Forums Home Page
I wrote a web scraping program to crawl the site for acquiring relevant data. The code was based on:
The site’s main data panel displays 20 entries, i.e. disucssion threads, per page. For 10,000 entries on 500 pages which apparently is a hard limit set by the site, the crawl process took about 50 minutes with the following Scrapy settings:
Although on page 500, the next age link remains shown and clickable. when accessing page 501 and beyond, the system consistently presnets an error message and without showing further forum data. The following is a screen capture showing what happened when accessing beyond page 500. Consequently, the number of pages to scrap was hard-coded as 500.
MSDN Forums Page 501 and Beyond
The crawl process scraped about 10 MB of data saved in a csv file. The created csv file resulted with a blank line after every data line, as shown below. Before RStudio imported the csv file, a Python routine first had removed all the blank lines.
Blank line after each data line
After eliminating the blank lines, RStudio successfully imported the file. The remining of this article highlights the follow-up effort to explore data, develop a model and examine the results.
The data frame created was with 10,000 observations of 17 variables.
# Spider's output
forums <- read.csv(file='msdnforums.csv')
dim(forums)
## [1] 10000 17
str(forums)
## 'data.frame': 10000 obs. of 17 variables:
## $ threadTitle : Factor w/ 9895 levels "'Add-AzureRmApplicationGatewayHttpListener' & 'Add-AzureRmApplicationGatewayRequestRoutingRule' doing nothing",..: 3172 8214 1339 1364 4070 5886 7654 3838 1718 6026 ...
## $ threadTitleLink: Factor w/ 9993 levels "https://social.msdn.microsoft.com/Forums/en-US/00017961-7db1-448b-b37a-752e5b399954/sp-form-integration-to-acce"| __truncated__,..: 880 1904 7108 9112 8905 231 4477 4045 9275 6501 ...
## $ threadSummary : Factor w/ 9911 levels "","' expected [D:\\home\\site\\wwwroot\\Microsoft.Bot.Sample.LuisBot.csproj]\n\nDialogs\\BasicLuisDialog.cs(110,74"| __truncated__,..: 5152 168 9904 4534 7266 3335 3511 4261 8443 1290 ...
## $ category : Factor w/ 37 levels ".NET Framework",..: 25 25 25 25 25 25 25 25 25 25 ...
## $ categoryLink : Factor w/ 37 levels "https://social.msdn.microsoft.com/Forums/en-US/home?category=apps",..: 24 24 24 24 24 24 24 24 24 24 ...
## $ subCategory : Factor w/ 148 levels ".NET Framework inside SQL Server",..: 114 114 114 114 114 114 114 114 114 114 ...
## $ subCategoryLink: Factor w/ 148 levels "https://social.msdn.microsoft.com/Forums/en-US/home?forum=accessdev",..: 98 98 98 98 98 98 98 98 98 98 ...
## $ votes : int 0 0 0 0 0 65 0 0 0 0 ...
## $ threadState : Factor w/ 4 levels "Answered","Discussion",..: 4 4 1 4 4 2 4 4 4 4 ...
## $ replyCount : int 1 5 26 1 0 86 0 6 0 0 ...
## $ viewCount : int 24 102 289 32 2 2716 6 89 9 10 ...
## $ createdByName : Factor w/ 7343 levels "'''HuuM'''","--_--",..: 6411 6956 4459 5383 7141 3692 1256 2508 4128 4403 ...
## $ createdByLink : Factor w/ 9828 levels "https://social.msdn.microsoft.com:443/profile/'''huum'''?type=forum&referrer=http://social.msdn.microsoft.com/F"| __truncated__,..: 8608 9307 5930 7168 9556 4940 1676 3342 5509 5858 ...
## $ createdByTime : Factor w/ 9329 levels "1 hour 18 minutes ago",..: 3 38 724 35 14 3498 50 9 58 65 ...
## $ lastReplyName : Factor w/ 7343 levels "'''HuuM'''","--_--",..: 6411 6956 4459 5383 7141 3692 1256 2508 4128 4403 ...
## $ lastReplyLink : Factor w/ 9828 levels "https://social.msdn.microsoft.com:443/profile/'''huum'''?type=forum&referrer=http://social.msdn.microsoft.com/F"| __truncated__,..: 8608 9307 5930 7168 9556 4940 1676 3342 5509 5858 ...
## $ lastReplyTime : Factor w/ 9329 levels "1 hour 18 minutes ago",..: 3 38 724 35 14 3498 50 9 58 65 ...
sum(is.na(forums)) # Any missing values?
## [1] 0
And a grand view took first.
The analysis is mainly on the three numberic fields. They are view count, reply count and votes. Other than two with timestamps, those non-essential fields were removed at this time.
## [1] "Data fields originally scraped"
## [1] "threadTitle" "threadTitleLink" "threadSummary"
## [4] "category" "categoryLink" "subCategory"
## [7] "subCategoryLink" "votes" "threadState"
## [10] "replyCount" "viewCount" "createdByName"
## [13] "createdByLink" "createdByTime" "lastReplyName"
## [16] "lastReplyLink" "lastReplyTime"
## [1] "Fields interested and kept"
## [1] "category" "subCategory" "threadState" "replyCount"
## [5] "viewCount" "votes" "createdByTime" "lastReplyTime"
and here were the first 10 rows of the processed dataset at this time.
| category | subCategory | threadState | replyCount | viewCount | votes | createdByTime | lastReplyTime |
|---|---|---|---|---|---|---|---|
| SQL Server | SQL Server XML | Unanswered | 1 | 24 | 0 | 1 hour 36 minutes ago | 1 hour 36 minutes ago |
| SQL Server | SQL Server XML | Unanswered | 5 | 102 | 0 | 20 hours 44 minutes ago | 20 hours 44 minutes ago |
| SQL Server | SQL Server XML | Answered | 26 | 289 | 0 | Friday, October 19, 2018 10:01 PM | Friday, October 19, 2018 10:01 PM |
| SQL Server | SQL Server XML | Unanswered | 1 | 32 | 0 | 2 hours 15 minutes ago | 2 hours 15 minutes ago |
| SQL Server | SQL Server XML | Unanswered | 0 | 2 | 0 | 12 minutes ago | 12 minutes ago |
| SQL Server | SQL Server XML | Discussion | 86 | 2716 | 65 | Saturday, October 28, 2017 5:43 PM | Saturday, October 28, 2017 5:43 PM |
| SQL Server | SQL Server XML | Unanswered | 0 | 6 | 0 | 26 minutes ago | 26 minutes ago |
| SQL Server | SQL Server XML | Unanswered | 6 | 89 | 0 | 10 hours 45 minutes ago | 10 hours 45 minutes ago |
| SQL Server | SQL Server XML | Unanswered | 0 | 9 | 0 | 38 minutes ago | 38 minutes ago |
| SQL Server | SQL Server XML | Unanswered | 0 | 10 | 0 | 41 minutes ago | 41 minutes ago |
Although this project does not include timeseries analysis, there are two fields with timestamps remain valuable for depicting the distribution of data overtime. The next is to convert these two timestamps:
into the R datatime format. There were however some complications. For those threads created or replied on a current day, the MSDN site used a non-standard data time format. Rather than, e.g.
it recorded something like
as the following:
head(forums$createdByTime)
## [1] 1 hour 36 minutes ago 20 hours 44 minutes ago
## [3] Friday, October 19, 2018 10:01 PM 2 hours 15 minutes ago
## [5] 12 minutes ago Saturday, October 28, 2017 5:43 PM
## 9329 Levels: 1 hour 18 minutes ago ... Wednesday, September 5, 2018 7:48 AM
These non-standard representations of timesamps must be converted to a standard date format for R to process.
To correct the problem, rows with these non-standard timestamps were replaced with a utc timestamp. This utc timestamp, as shown below, logged at the web-scraping applicaiton start-time, is placed in the 1st line of the output log for serving as a replacement for those rows with non-standard representations of time.
class(forums$createdByTime)
## [1] "factor"
head(forums$createdByTime)
## [1] 1 hour 36 minutes ago 20 hours 44 minutes ago
## [3] Friday, October 19, 2018 10:01 PM 2 hours 15 minutes ago
## [5] 12 minutes ago Saturday, October 28, 2017 5:43 PM
## 9329 Levels: 1 hour 18 minutes ago ... Wednesday, September 5, 2018 7:48 AM
class(forums$lastReplyTime)
## [1] "factor"
head(forums$lastReplyTime)
## [1] 1 hour 36 minutes ago 20 hours 44 minutes ago
## [3] Friday, October 19, 2018 10:01 PM 2 hours 15 minutes ago
## [5] 12 minutes ago Saturday, October 28, 2017 5:43 PM
## 9329 Levels: 1 hour 18 minutes ago ... Wednesday, September 5, 2018 7:48 AM
This utc timestamp in the log file was then read in and used for replacing those non-standard timestamps.
## [[1]]
## [1] 2018-10-27
## Levels: 2018-10-27
##
## [[2]]
## [1] 21:23:45.324523
## Levels: 21:23:45.324523
##
## [[3]]
## [1] "Friday, October 19, 2018 10:01 PM"
##
## [[4]]
## [1] 2018-10-27
## Levels: 2018-10-27
##
## [[5]]
## [1] 21:23:45.324523
## Levels: 21:23:45.324523
##
## [[6]]
## [1] "Saturday, October 28, 2017 5:43 PM"
## [[1]]
## [1] 2018-10-27
## Levels: 2018-10-27
##
## [[2]]
## [1] 21:23:45.324523
## Levels: 21:23:45.324523
##
## [[3]]
## [1] "Friday, October 19, 2018 10:01 PM"
##
## [[4]]
## [1] 2018-10-27
## Levels: 2018-10-27
##
## [[5]]
## [1] 21:23:45.324523
## Levels: 21:23:45.324523
##
## [[6]]
## [1] "Saturday, October 28, 2017 5:43 PM"
Now the data was prepared and ready.
An overview of the prepared dataset was the following.
First, reviewed the distribution of individual data fields overtime. The following plots are interactive and click the magify glass icon on top menu bar of the plot to zoom in for details of an observation.
One thing I’ve found interesting is that historically there seemed little evidence with churns of activities influenced by product announcements, releases, product end-of-life, cyber attacks, etc.
## viewCount replyCount votes viewCount/votes
## Min. : 2 Min. : 0.000 Min. : 0.0000 Min. :9.188
## 1st Qu.: 80 1st Qu.: 1.000 1st Qu.: 0.0000 1st Qu.: Inf
## Median : 132 Median : 3.000 Median : 0.0000 Median : Inf
## Mean : 1542 Mean : 4.306 Mean : 0.7986 Mean : Inf
## 3rd Qu.: 219 3rd Qu.: 5.000 3rd Qu.: 0.0000 3rd Qu.: Inf
## Max. :485555 Max. :1815.000 Max. :425.0000 Max. : Inf
## Standard Deviation 11212.02 19.19455 6.806855 NaN
## viewCount replyCount votes viewCount/votes
## viewCount 1.0000000 0.2083569 0.3092113 NaN
## replyCount 0.2083569 1.0000000 0.6677484 NaN
## votes 0.3092113 0.6677484 1.0000000 NaN
## viewCount/votes NaN NaN NaN 1
The correlation between among the three is expected.
Took a closer view on:
A linear relationship seemed apparaent.
Move the cursor to the blank area and click as needed, if the plot does not display itself autotmatically. Drag to rotate and zoom in for details of custered observations.
The blue ones were those threadss with no or few replies, and similarily few or no votes, despite they might had had a few views. They all clustered near the edge where replyCount and votes cordinate orgins interset.Considering log-vireCount as x axis and log-votes as the y axis, the distribution of these observations indicated apparent lineraity.
The intent here is to investigate how a multiple linear regression model may fit.
\[\hat{votes} = \beta~0 + \beta~1 * viewCount + \beta~2 * replyCount\]
Here’s how this model looked:
##
## Call:
## lm(formula = v200 ~ vc200 + rc200, data = vcrcv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.1657 -0.7237 0.6908 0.9338 19.1749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.4799542 0.2045854 -7.234 1.02e-11 ***
## vc200 -0.0001145 0.0001396 -0.820 0.413
## rc200 0.5536765 0.0257211 21.526 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.523 on 197 degrees of freedom
## Multiple R-squared: 0.708, Adjusted R-squared: 0.705
## F-statistic: 238.8 on 2 and 197 DF, p-value: < 2.2e-16
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.4799542051 0.2045854084 -7.2339187 1.018614e-11
## vc200 -0.0001145445 0.0001396068 -0.8204799 4.129343e-01
## rc200 0.5536764976 0.0257210729 21.5261820 1.217897e-53
##
## Call:
## lm(formula = v1000 ~ vc1000 + rc1000, data = vcrcv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.868 -0.532 -0.026 0.477 43.752
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.791e-01 7.107e-02 -6.741 2.65e-11 ***
## vc1000 1.355e-05 5.727e-06 2.365 0.0182 *
## rc1000 2.522e-01 7.995e-03 31.549 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.017 on 997 degrees of freedom
## Multiple R-squared: 0.5604, Adjusted R-squared: 0.5595
## F-statistic: 635.5 on 2 and 997 DF, p-value: < 2.2e-16
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.790866e-01 7.106625e-02 -6.741409 2.650981e-11
## vc1000 1.354641e-05 5.726695e-06 2.365485 1.819705e-02
## rc1000 2.522164e-01 7.994502e-03 31.548735 4.693366e-152
##
## Call:
## lm(formula = v2000 ~ vc2000 + rc2000, data = vcrcv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.761 -0.513 -0.077 0.340 110.863
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.606e-01 8.204e-02 -4.395 1.16e-05 ***
## vc2000 9.996e-05 7.374e-06 13.556 < 2e-16 ***
## rc2000 2.143e-01 1.072e-02 19.996 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.162 on 1997 degrees of freedom
## Multiple R-squared: 0.3144, Adjusted R-squared: 0.3137
## F-statistic: 457.9 on 2 and 1997 DF, p-value: < 2.2e-16
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.606044e-01 8.204045e-02 -4.395446 1.163475e-05
## vc2000 9.995965e-05 7.373948e-06 13.555785 4.113586e-40
## rc2000 2.142980e-01 1.071699e-02 19.996103 3.138031e-81
##
## Call:
## lm(formula = v ~ vc + rc, data = vcrcv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.88 -0.58 -0.13 0.20 392.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.309e-01 5.070e-02 -6.527 7.04e-11 ***
## vc 1.079e-04 4.493e-06 24.022 < 2e-16 ***
## rc 2.237e-01 2.625e-03 85.214 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.927 on 9997 degrees of freedom
## Multiple R-squared: 0.4761, Adjusted R-squared: 0.476
## F-statistic: 4543 on 2 and 9997 DF, p-value: < 2.2e-16
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.330935282 5.070370e-02 -6.526846 7.042642e-11
## vc 0.000107943 4.493429e-06 24.022411 5.076744e-124
## rc 0.223662468 2.624725e-03 85.213687 0.000000e+00
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]